Add agent skill for creating new model by tianleiwu · Pull Request #2206 · microsoft/onnxruntime-genai

tianleiwu · 2026-06-09T23:13:50Z

Add a skill for agents to create new model.

Copilot

Pull request overview

Adds a new agent skill document intended to guide contributors/agents through adding support for new HuggingFace model architectures in the Python model builder and debugging ONNX-vs-PyTorch numerical parity issues.

Changes:

Introduces .github/skills/new-model/SKILL.md with file orientation, builder dispatch guidance, and an export/debugging workflow.
Documents MoE/QMoE-related implementation notes and common parity failure modes.

tianleiwu · 2026-06-10T01:06:50Z

+- `make_moe_op` emits `MoE` (fp16) or `QMoE` (int4/int8). `make_qmoe_weights`
+  quantizes and packs each expert weight `[N, K]`.
+- **CUDA QMoE weight encoding (critical):** the kernel is a CUTLASS fpA_intB
+  mixed GEMM that consumes **offline-prepacked** weights. The proven recipe
+  (see `_cutlass_prepacked_blockwise_quantize` in `base.py`):
+  1. transpose weight to `[K, N]`;
+  2. `onnxruntime...quantize_matmul_4bits(qw, w_T, scales, zp, block, N, K, is_symmetric=True)`;
+  3. **keep the SIGNED scales** — do NOT `abs()` them. The kernel dequantizes as
+     `(q - 8) * scale`, and `quantize_matmul_4bits` folds the block-anchor sign


pack_weights_for_cuda_mixed_gemm existed in onnxruntime-gpu package >= 1.27. It is preferred way to pack weights (no dependency on tensorrt-llm). See https://github.com/microsoft/onnxruntime/blob/main/docs/contrib_ops/cuda/moe_qmoe.md

Fixed in the latest commit. Removed the references to _cutlass_prepacked_blockwise_quantize, quantize_matmul_4bits, force_arch=80, and the signed-scale claim. Kept pack_weights_for_cuda_mixed_gemm (from onnxruntime-gpu >= 1.27) as the preferred, TRT-LLM-free packing approach, and updated the recipe to reflect the actual _symmetric_blockwise_quantize + pack_weights_for_cuda_mixed_gemm flow.

tianleiwu · 2026-06-10T01:07:40Z

+- The QMoE node then uses the **default** `weights_prepacked` (omit the attribute;
+  default = prepacked). Do **not** set `weights_prepacked=0` (the raw-weight +
+  runtime-PrePack-hook path is finiteness-checked only and is not bit-correct).
+- **CUDA QMoE only supports `block_size` 64 or 128.** Assert this in the builder.


It is limitation (block size 64/128) of cuda QMoE op. See https://github.com/microsoft/onnxruntime/blob/main/docs/contrib_ops/cuda/moe_qmoe.md for details.

Fixed — updated the bullet to state that qmoe_block_size supports values 16, 32, 64, 128, or 256 (default 128 for CUDA/TRT-RTX, 32 otherwise), and that _symmetric_blockwise_quantize pads automatically.

Copilot · 2026-06-10T01:07:57Z

+- The QMoE node then uses the **default** `weights_prepacked` (omit the attribute;
+  default = prepacked). Do **not** set `weights_prepacked=0` (the raw-weight +
+  runtime-PrePack-hook path is finiteness-checked only and is not bit-correct).


Removed the weights_prepacked bullet entirely — make_qmoe_op does not set that attribute and it has no presence in the codebase.

The weights_prepacked attribute is a new attribute added in latest onnxruntime: microsoft/onnxruntime#28749.

…size claim

Add a skill for new model

12648be

Copilot AI review requested due to automatic review settings June 9, 2026 23:13

Copilot started reviewing on behalf of tianleiwu June 9, 2026 23:14 View session

Copilot AI reviewed Jun 9, 2026

View reviewed changes

Copilot started work on behalf of tianleiwu June 10, 2026 01:03 View session

Fix SKILL.md MoE/QMoE section: remove inaccurate guidance, fix block_…

45b236c

…size claim

Copilot finished work on behalf of tianleiwu June 10, 2026 01:08

tianleiwu marked this pull request as draft June 10, 2026 01:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add agent skill for creating new model#2206

Add agent skill for creating new model#2206
tianleiwu wants to merge 2 commits into
mainfrom
tlwu/new-model-skills

tianleiwu commented Jun 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

tianleiwu Jun 10, 2026

Uh oh!

Copilot AI Jun 10, 2026

Uh oh!

tianleiwu Jun 10, 2026

Uh oh!

Copilot AI Jun 10, 2026

Uh oh!

Copilot AI Jun 10, 2026

Uh oh!

tianleiwu Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

tianleiwu commented Jun 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

tianleiwu Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

tianleiwu Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

tianleiwu Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants